Kwanza Tukule Data Analysis¶
This report showcases the completion of the Kwanza Tukule Data Analyst Assessment, designed to evaluate my technical, analytical, and problem-solving skills. Using the provided anonymized sales dataset, I applied data cleaning, preparation, and exploratory analysis techniques to uncover actionable insights. The analysis includes identifying trends, customer segmentation, forecasting, and anomaly detection to address key business challenges. Strategic recommendations are derived from the findings, focusing on product performance, customer retention, and operational efficiency. Additionally, I developed a dashboard to visualize key metrics, enabling better decision-making. This report demonstrates my ability to analyze data effectively and present meaningful insights in a structured, professional manner.
Importing Libraries¶
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline
#html export
import plotly.io as pio
pio.renderers.default = 'notebook'
Loading Dataset¶
df = pd.read_excel(r'Case Study Data - Read Only.xlsx')
df
| DATE | ANONYMIZED CATEGORY | ANONYMIZED PRODUCT | ANONYMIZED BUSINESS | ANONYMIZED LOCATION | QUANTITY | UNIT PRICE | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | Category-106 | Product-21f4 | Business-de42 | Location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | Category-120 | Product-4156 | Business-de42 | Location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | Category-121 | Product-49bd | Business-de42 | Location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | Category-76 | Product-61dd | Business-de42 | Location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | Category-119 | Product-66e0 | Business-de42 | Location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333400 | 2024-01-09 20:49:00 | Category-119 | Product-e98d | Business-f9ff | Location-1979 | 1 | 1770.0 |
| 333401 | 2024-08-20 20:24:00 | Category-106 | Product-21f4 | Business-72bc | Location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | Category-76 | Product-6e9c | Business-72bc | Location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | Category-91 | Product-523e | Business-2693 | Location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | Category-75 | Product-b31e | Business-2693 | Location-b27b | 1 | 4420.0 |
333405 rows × 7 columns
Attributes¶
DATE: Represents the date on which the transaction or activity occurred.
ANONYMIZED CATEGORY: Categorization of the product or service involved in the transaction, with the actual category names replaced by anonymized labels.
ANONYMIZED PRODUCT: Refers to the specific product associated with the transaction, with product names anonymized for confidentiality.
ANONYMIZED BUSINESS: Represents the business entity involved in the transaction.
ANONYMIZED LOCATION: The location where the transaction or activity occurred.
QUANTITY: The quantity of the product involved in the transaction, recorded as an integer.
UNIT PRICE: The price per unit of the product.
Data Cleaning and Preparation¶
# make column names and values uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = df.dtypes[df.dtypes == 'object'].index
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333400 | 2024-01-09 20:49:00 | category-119 | product-e98d | business-f9ff | location-1979 | 1 | 1770.0 |
| 333401 | 2024-08-20 20:24:00 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
333405 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 333405 entries, 0 to 333404 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 333405 non-null datetime64[ns] 1 anonymized_category 333405 non-null object 2 anonymized_product 333405 non-null object 3 anonymized_business 333405 non-null object 4 anonymized_location 333405 non-null object 5 quantity 333405 non-null int64 6 unit_price 333397 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(4) memory usage: 17.8+ MB
# duplicates
df.duplicated().sum()
3524
There are 3524 duplicates. We need to drop them
#dropping duplicates
df = df.drop_duplicates()
df.isna().sum()
date 0 anonymized_category 0 anonymized_product 0 anonymized_business 0 anonymized_location 0 quantity 0 unit_price 8 dtype: int64
The column 'unit_price' has 8 null values
df[df.unit_price.isna()]
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 108112 | 2024-07-03 17:53:00 | category-94 | product-3d7f | business-4fce | location-f37d | 2 | NaN |
| 150961 | 2024-12-16 18:33:00 | category-79 | product-dfc8 | business-8bbf | location-3fc0 | 1 | NaN |
| 151142 | 2024-12-22 14:42:00 | category-122 | product-15e0 | business-c575 | location-1979 | 3 | NaN |
| 272379 | 2024-06-27 12:15:00 | category-92 | product-ccbc | business-14b6 | location-1979 | 1 | NaN |
| 278284 | 2024-08-14 21:09:00 | category-101 | product-84a5 | business-4be1 | location-bb69 | 21 | NaN |
| 278384 | 2024-12-30 14:17:00 | category-95 | product-15f3 | business-1a74 | location-f37d | 1 | NaN |
| 310385 | 2024-03-31 14:03:00 | category-114 | product-9204 | business-c9dc | location-689f | 1 | NaN |
| 327152 | 2024-08-13 16:20:00 | category-107 | product-7eed | business-0d61 | location-1ba8 | 1 | NaN |
Lets try to see whether its the anonymized_product that determines the unit price so that we fill the nulls
df_product_3d7fdf = df[df['anonymized_product'] == 'product-3d7f']
df_product_3d7fdf
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 372 | 2024-07-11 11:20:00 | category-94 | product-3d7f | business-2b24 | location-66f4 | 5 | 875.0 |
| 447 | 2024-12-09 19:35:00 | category-94 | product-3d7f | business-1b52 | location-7f37 | 1 | 885.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
3164 rows × 7 columns
df_product_3d7fdf['unit_price'].unique()
array([860., 875., 885., 870., nan])
For product 'product-3d7f' we can see there's 3 different unit prices: 860.0, 875.0, 885.0, 870.0
df_product_3d7fdf[df_product_3d7fdf['unit_price'] == 860.0]
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 565 | 2024-09-15 15:52:00 | category-94 | product-3d7f | business-f749 | location-3fc0 | 2 | 860.0 |
| 780 | 2024-10-03 15:08:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
2344 rows × 7 columns
Lets try to see whether the anonymized category is responsible for the price 860.0
df_product_3d7fdf[df_product_3d7fdf['anonymized_category'] == 'category-94']
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 372 | 2024-07-11 11:20:00 | category-94 | product-3d7f | business-2b24 | location-66f4 | 5 | 875.0 |
| 447 | 2024-12-09 19:35:00 | category-94 | product-3d7f | business-1b52 | location-7f37 | 1 | 885.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
3164 rows × 7 columns
Even for product 'product-3d7f' with category 'category-94', we still have no supporting data to associate the unit price to.
Hence, its safe to drop the nulls
# dropping nulls
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 329873 entries, 0 to 333404 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 329873 non-null datetime64[ns] 1 anonymized_category 329873 non-null object 2 anonymized_product 329873 non-null object 3 anonymized_business 329873 non-null object 4 anonymized_location 329873 non-null object 5 quantity 329873 non-null int64 6 unit_price 329873 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(4) memory usage: 20.1+ MB
Summary of Issues Identified and Steps Taken to Resolve Them:¶
- Duplicate Records:
- Issue Identified: The dataset initially contained 3524 duplicate records, which could skew the analysis and affect the accuracy of any insights derived from the data.
- Steps Taken: The duplicates were identified and dropped from the dataset using the .duplicated().sum() method and .drop_duplicates() to ensure each record is unique and accurately represents individual sales transactions.
- Null Values in 'unit_price' Column:
- Issue Identified: There were 8 null values in the unit_price column, which is critical as the price of the product must be available for proper calculations of sales value and other analyses.
- Steps Taken: The null values in the unit_price column were identified using .isnull().sum() and subsequently dropped from the dataset using .dropna() to remove any incomplete records that might lead to errors in subsequent analysis.
Final Dataset: After addressing these issues, the dataset now contains 329,873 non-null records, with no duplicate entries and all null values removed from the unit_price column. The dataset's columns now have consistent data types and no missing or redundant data, making it ready for further analysis.
Feature Engineering:
Creating the column: “Month-Year” (e.g., August 2024) from the “DATE” column.
df
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | 2024-11-13 13:59:00 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 |
| 333401 | 2024-08-20 20:24:00 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
329873 rows × 7 columns
# since column 'date' is already a datetime object,
# ------------Create the 'Month-Year' column---------------#
df2 = df.copy() # creating a copy for original cleaned df
# Create a 'Month-Year' column, keeping the date as the first of the month
df2['Month-Year'] = df['date'].dt.to_period('M').dt.to_timestamp()
# Format it for display, but still keep the datetime
df2['Month-Year'] = df2['Month-Year'].dt.strftime('%B %Y')
# Drop the 'date' column
df2.drop(columns=['date'], inplace=True)
# Move 'Month-Year' column to the first position
columns = ['Month-Year'] + [col for col in df2.columns if col != 'Month-Year']
df2 = df2[columns]
# Display the updated DataFrame to check the new column
df2
| Month-Year | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
329873 rows × 7 columns
EDA (Exploratory Data Analysis)¶
Sales Overview¶
# Calculate the 'Value' column (quantity × unit_price)
df3 = df2.copy()
df3['Value'] = df3['quantity'] * df3['unit_price']
df3
| Month-Year | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | Value | |
|---|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 | 850.0 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 | 3820.0 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 | 3670.0 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 | 2605.0 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 | 7400.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 | 6800.0 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 | 850.0 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 | 2575.0 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 | 1520.0 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 | 4420.0 |
329873 rows × 8 columns
Calculating the sales for each unique category
# Group by anonymized_category
category_sales = df3.groupby('anonymized_category').agg(
Total_Quantity=('quantity', 'sum'),
Total_Value=('Value', 'sum')
).reset_index()
category_sales
| anonymized_category | Total_Quantity | Total_Value | |
|---|---|---|---|
| 0 | category-100 | 76824 | 134902751.0 |
| 1 | category-101 | 19564 | 35577822.0 |
| 2 | category-102 | 1786 | 464463.0 |
| 3 | category-104 | 1217 | 1557598.0 |
| 4 | category-105 | 1579 | 2690719.0 |
| 5 | category-106 | 6521 | 5932763.0 |
| 6 | category-107 | 2729 | 4170797.0 |
| 7 | category-108 | 9756 | 5101375.0 |
| 8 | category-109 | 1446 | 1263226.0 |
| 9 | category-110 | 10529 | 5483386.0 |
| 10 | category-111 | 6715 | 4387343.0 |
| 11 | category-113 | 741 | 1254083.0 |
| 12 | category-114 | 3 | 8600.0 |
| 13 | category-115 | 348 | 425360.0 |
| 14 | category-116 | 856 | 422745.0 |
| 15 | category-117 | 5 | 1550.0 |
| 16 | category-118 | 21 | 7560.0 |
| 17 | category-119 | 68332 | 103454819.0 |
| 18 | category-120 | 169715 | 319178743.0 |
| 19 | category-121 | 14669 | 22327643.0 |
| 20 | category-122 | 1223 | 3493480.0 |
| 21 | category-123 | 286 | 730730.0 |
| 22 | category-124 | 4 | 10060.0 |
| 23 | category-125 | 123 | 297060.0 |
| 24 | category-74 | 941 | 1927871.0 |
| 25 | category-75 | 151330 | 544658700.0 |
| 26 | category-76 | 71719 | 344939553.0 |
| 27 | category-77 | 28455 | 76741382.0 |
| 28 | category-78 | 9766 | 9792609.0 |
| 29 | category-79 | 2215 | 1184953.0 |
| 30 | category-81 | 142 | 72061.0 |
| 31 | category-82 | 4759 | 3930818.0 |
| 32 | category-83 | 2436 | 4039483.0 |
| 33 | category-84 | 11933 | 6798158.0 |
| 34 | category-85 | 22997 | 33762533.0 |
| 35 | category-86 | 8 | 3320.0 |
| 36 | category-89 | 238 | 136850.0 |
| 37 | category-90 | 15 | 15750.0 |
| 38 | category-91 | 20853 | 44152103.0 |
| 39 | category-92 | 6953 | 10468723.0 |
| 40 | category-94 | 23668 | 16750815.0 |
| 41 | category-95 | 4116 | 7466932.0 |
| 42 | category-96 | 1427 | 2249424.0 |
| 43 | category-97 | 2711 | 2628309.0 |
| 44 | category-98 | 2152 | 2519695.0 |
| 45 | category-99 | 1964 | 1589480.0 |
# graphing total sales and total value per anonymized category
import plotly.graph_objects as go
# bar chart for Total Quantity
bar = go.Bar(
x=category_sales["anonymized_category"],
y=category_sales["Total_Quantity"],
name="Total Quantity",
marker=dict(color="steelblue")
)
# line chart for Total Value
line = go.Scatter(
x=category_sales["anonymized_category"],
y=category_sales["Total_Value"],
name="Total Value",
mode="lines+markers", # Line with markers
line=dict(color="firebrick", width=3), # Line color and width
marker=dict(size=8), # Marker size
yaxis="y2" # Assign to secondary y-axis
)
# Combine both traces
fig = go.Figure(data=[bar, line])
# Update layout for dual y-axes
fig.update_layout(
title="Sales Overview by Category",
xaxis_title="Category",
yaxis=dict(
title="Total Quantity",
titlefont=dict(color="steelblue"), # Match bar chart color
tickfont=dict(color="steelblue") # Match bar chart color
),
yaxis2=dict(
title="Total Value",
titlefont=dict(color="firebrick"), # Match line chart color
tickfont=dict(color="firebrick"), # Match line chart color
overlaying="y", # Overlay on the same plot
side="right" # Position on the right
),
legend_title="Metrics",
barmode="group", # Group bars together
template="plotly_white" # Cleaner theme
)
# Show the figure
fig.show()
∴ From the combined Line and Bar graph plot, we can deduce 'category-120' has the highest amount of sales with over 169.7k total quantities sold.
However, when it comes to the total value gained, 'category-75' has the highest sales with over 544.6M in revenue
Calculating the sales for each unique business.
# Group by anonymized_business
business_sales = df3.groupby('anonymized_business').agg(
Total_Quantity=('quantity', 'sum'),
Total_Value=('Value', 'sum')
).reset_index()
business_sales
| anonymized_business | Total_Quantity | Total_Value | |
|---|---|---|---|
| 0 | business-0000 | 8 | 10445.0 |
| 1 | business-0005 | 1 | 2645.0 |
| 2 | business-0029 | 26 | 77340.0 |
| 3 | business-003d | 98 | 221761.0 |
| 4 | business-0072 | 127 | 225056.0 |
| ... | ... | ... | ... |
| 4795 | business-ffa9 | 3 | 6740.0 |
| 4796 | business-ffae | 6 | 10530.0 |
| 4797 | business-ffb1 | 266 | 438115.0 |
| 4798 | business-ffd2 | 37 | 67723.0 |
| 4799 | business-ffff | 110 | 110285.0 |
4800 rows × 3 columns
For the sales per Business comparison, 2 charts is appropriate due to the large number of values plotted
# Line chart for Total Value (green)
bar_value = go.Line(
x=business_sales["anonymized_business"],
y=business_sales["Total_Value"],
name="Total Value",
marker=dict(color="green"),
)
# layout
layout_value = go.Layout(
title="Total Value by Business",
xaxis_title="Business",
yaxis_title="Total Value",
template="plotly_white",
)
# figure
fig_value = go.Figure(data=[bar_value], layout=layout_value)
# Show the plot
fig_value.show()
c:\Program Files\Python312\Lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 28M revenue.
# bar chart for Total Quantity (blue)
bar_quantity = go.Line(
x=business_sales["anonymized_business"],
y=business_sales["Total_Quantity"],
name="Total Quantity",
marker=dict(color="blue"),
)
# layout
layout_quantity = go.Layout(
title="Total Quantity by Business",
xaxis_title="Business",
yaxis_title="Total Quantity",
template="plotly_white",
)
# figure
fig_quantity = go.Figure(data=[bar_quantity], layout=layout_quantity)
# Show the plot
fig_quantity.show()
∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 13.9k total quantities sold.
Trends Over Time¶
To create a time series plot we need a datetime variable
# Create a copy of df
df4 = df.copy()
# Add 'Value' column
df4['value'] = df4['quantity'] * df4['unit_price']
# Create a new column 'Month-Year' in df4 (not df)
df4['month-year'] = df4['date'].dt.to_period('M').dt.to_timestamp()
# Format 'Month-Year' to display as "August 2024"
df4['month-year-formatted'] = df4['month-year'].dt.strftime('%B %Y')
# Reorder columns to make 'Month-Year-Formatted' the first column
df4 = df4[['month-year-formatted'] + [col for col in df4.columns if col != 'month-year-formatted']]
df4.info()
<class 'pandas.core.frame.DataFrame'> Index: 329873 entries, 0 to 333404 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month-year-formatted 329873 non-null object 1 date 329873 non-null datetime64[ns] 2 anonymized_category 329873 non-null object 3 anonymized_product 329873 non-null object 4 anonymized_business 329873 non-null object 5 anonymized_location 329873 non-null object 6 quantity 329873 non-null int64 7 unit_price 329873 non-null float64 8 value 329873 non-null float64 9 month-year 329873 non-null datetime64[ns] dtypes: datetime64[ns](2), float64(2), int64(1), object(5) memory usage: 27.7+ MB
Our datetime object is set to the first of the month for aggregation, for instance, any date in January is recorded as Jan 1, 2024
# Group the data by 'Month-Year' and calculate the monthly average of 'Quantity' and 'Value'
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()
# Convert 'Month-Year' to a string for plotting purposes
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)
# Group the data by 'Month-Year' and calculate the monthly averages
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()
# Convert 'Month-Year' to a string for plotting
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)
# Create a line plot with dual y-axes
fig = go.Figure()
# Add the line for Value
fig.add_trace(go.Scatter(
x=monthly_avg_sales['month-year'],
y=monthly_avg_sales['value'],
name='Average Value',
line=dict(color='red'),
yaxis='y1'
))
# Add the line for Quantity
fig.add_trace(go.Scatter(
x=monthly_avg_sales['month-year'],
y=monthly_avg_sales['quantity'],
name='Average Quantity',
line=dict(color='green'),
yaxis='y2'
))
# Update layout for dual y-axes
fig.update_layout(
title="Monthly Averages of Sales (Quantity and Value)",
xaxis=dict(title="Month-Year"),
yaxis=dict(
title="Average Value",
titlefont=dict(color="blue"),
tickfont=dict(color="blue"),
),
yaxis2=dict(
title="Average Quantity",
titlefont=dict(color="green"),
tickfont=dict(color="green"),
anchor="x",
overlaying="y",
side="right",
),
legend=dict(x=0.5, y=1.1, orientation="h"),
)
# Show the plot
fig.show()
Monthly averages are useful for understanding the typical performance per month, especially when there are outliers or variations in the data
For the monthly average sales(Both quantity and value), there is a downward slope,indicating a decreasing trend in the sales over time.
# Group the data by 'Month-Year' and calculate the monthly sum of 'Quantity' and 'Value'
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Create a line plot with dual y-axes
fig = go.Figure()
# Add the line for Value
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['value'],
name='Total Sum Value',
line=dict(color='red'),
yaxis='y1'
))
# Add the line for Quantity
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['quantity'],
name='Total Sum Quantity',
line=dict(color='green'),
yaxis='y2'
))
# Update layout for dual y-axes
fig.update_layout(
title="Monthly Sum of Sales (Quantity and Value)",
xaxis=dict(title="Month-Year"),
yaxis=dict(
title="Total Value",
titlefont=dict(color="blue"),
tickfont=dict(color="blue"),
),
yaxis2=dict(
title="Total Quantity",
titlefont=dict(color="green"),
tickfont=dict(color="green"),
anchor="x",
overlaying="y",
side="right",
),
legend=dict(x=0.5, y=1.1, orientation="h"),
)
# Show the plot
fig.show()
For understanding overall monthly performance, aggregate the data by summing the sales for each month.
The trend has visible fluctuations, with peaks in certain months(May, July and October), it might suggest seasonal effects.
Performance analysis¶
# top 5 most frequently purchased products (based on Quantity).
# Group by product and sum the quantities
top_products = df4.groupby('anonymized_product')['quantity'].sum().reset_index()
# Sort in descending order and select the top 5
top_5_products = top_products.sort_values(by='quantity', ascending=False).head(5)
# Display the result
top_5_products
| anonymized_product | quantity | |
|---|---|---|
| 338 | product-66e0 | 46957 |
| 750 | product-e805 | 42602 |
| 476 | product-8f75 | 37566 |
| 127 | product-29ee | 35940 |
| 213 | product-4156 | 28487 |
# Create a bar graph
fig = px.bar(
top_5_products,
x='anonymized_product',
y='quantity',
title='Top 5 Most Purchased Products (Quantity)',
labels={'anonymized_product': 'Product', 'quantity': 'Total Quantity'},
text='quantity' # Display the quantity on the bars
)
# Customize the layout
fig.update_layout(
xaxis_title='Product',
yaxis_title='Total Quantity',
template='plotly_white'
)
# Show the plot
fig.show()
# top 5 most valuable products (based on Value).
# Group by product and sum the quantities
top_value_products = df4.groupby('anonymized_product')['value'].sum().reset_index()
# Sort in descending order and select the top 5
top_value_products = top_value_products.sort_values(by='value', ascending=False).head(5)
# Display the result
top_value_products
| anonymized_product | value | |
|---|---|---|
| 750 | product-e805 | 262787281.0 |
| 476 | product-8f75 | 158797460.0 |
| 338 | product-66e0 | 70704225.0 |
| 127 | product-29ee | 68248274.0 |
| 213 | product-4156 | 56956007.0 |
# Create a bar graph
fig = px.bar(
top_value_products,
x='anonymized_product',
y='value',
title='Top 5 Most Valuable Products (Value)',
labels={'anonymized_product': 'Product', 'value': 'Total Quantity'},
text='value' # Display the quantity on the bars
)
# Customize the layout
fig.update_layout(
xaxis_title='Product',
yaxis_title='Total Value',
template='plotly_white'
)
# Show the plot
fig.show()